HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. It runs a lot of operational projects from time to time along with advocacy drives to raise awareness as well as for funding purposes.
After the recent funding programmes, they have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. The significant issues that come while making this decision are mostly related to choosing the countries that are in the direst need of aid.
To categorize/segment countries using socio-economic and health factors to identify which countries need financial assistance the most.
pip install kneed
Requirement already satisfied: kneed in /Users/griotinsights/anaconda3/lib/python3.10/site-packages (0.8.3) Requirement already satisfied: scipy>=1.0.0 in /Users/griotinsights/anaconda3/lib/python3.10/site-packages (from kneed) (1.10.0) Requirement already satisfied: numpy>=1.14.2 in /Users/griotinsights/anaconda3/lib/python3.10/site-packages (from kneed) (1.23.5) Note: you may need to restart the kernel to use updated packages.
# DATA ANALYSIS AND VISUALIZATION LIBRARIES
import pandas as pd
import numpy as np
import pandas as pd
from random import sample
from numpy.random import uniform
from math import isnan
import seaborn as sns
import matplotlib.pyplot as plt
# MACHINE lEARNING LIBRARIES
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.neighbors import NearestNeighbors
from kneed import KneeLocator
import warnings
warnings.filterwarnings('ignore')
# IMPORTING EXCEL FILES
country_df = pd.read_excel("/Users/griotinsights/Desktop/DATASETS/HELP INTERNATIONAL/Country-data.xls")
country_df.head(10)
country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.440 | 56.2 | 5.82 | 553 |
1 | Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.490 | 76.3 | 1.65 | 4090 |
2 | Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.100 | 76.5 | 2.89 | 4460 |
3 | Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.400 | 60.1 | 6.16 | 3530 |
4 | Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.440 | 76.8 | 2.13 | 12200 |
5 | Argentina | 14.5 | 18.9 | 8.10 | 16.0 | 18700 | 20.900 | 75.8 | 2.37 | 10300 |
6 | Armenia | 18.1 | 20.8 | 4.40 | 45.3 | 6700 | 7.770 | 73.3 | 1.69 | 3220 |
7 | Australia | 4.8 | 19.8 | 8.73 | 20.9 | 41400 | 1.160 | 82.0 | 1.93 | 51900 |
8 | Austria | 4.3 | 51.3 | 11.00 | 47.8 | 43200 | 0.873 | 80.5 | 1.44 | 46900 |
9 | Azerbaijan | 39.2 | 54.3 | 5.88 | 20.7 | 16000 | 13.800 | 69.1 | 1.92 | 5840 |
country_df.shape
(167, 10)
country_df.isnull().sum()
country 0 child_mort 0 exports 0 health 0 imports 0 income 0 inflation 0 life_expec 0 total_fer 0 gdpp 0 dtype: int64
country_df.duplicated().sum()
0
country_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 167 entries, 0 to 166 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 167 non-null object 1 child_mort 167 non-null float64 2 exports 167 non-null float64 3 health 167 non-null float64 4 imports 167 non-null float64 5 income 167 non-null int64 6 inflation 167 non-null float64 7 life_expec 167 non-null float64 8 total_fer 167 non-null float64 9 gdpp 167 non-null int64 dtypes: float64(7), int64(2), object(1) memory usage: 13.2+ KB
According to the data dictionary, imports, exports and health are represented as percentages of GDP per capita. Using these figures for further analysis can skew results. It can give the impression that certain countries spend similar amounts on health such as Australia and Afghanistan(8.73% and 7.58%). However, this is inaccurate especially when their respective GDP per capita are far apart, hence the need to convert them into their actual values.
country_df['exports'] = (country_df['exports']/100) * country_df['gdpp']
country_df['imports'] = (country_df['imports']/100) * country_df['gdpp']
country_df['health'] = (country_df['health']/100) * country_df['gdpp']
country_df.head()
country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 90.2 | 55.30 | 41.9174 | 248.297 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
1 | Albania | 16.6 | 1145.20 | 267.8950 | 1987.740 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
2 | Algeria | 27.3 | 1712.64 | 185.9820 | 1400.440 | 12900 | 16.10 | 76.5 | 2.89 | 4460 |
3 | Angola | 119.0 | 2199.19 | 100.6050 | 1514.370 | 5900 | 22.40 | 60.1 | 6.16 | 3530 |
4 | Antigua and Barbuda | 10.3 | 5551.00 | 735.6600 | 7185.800 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
It is the process of performing initial investigation and analyses to undersatnd the data by discovering trends, spotting anomalies and checking assumptions by using statistical summaries and data visualizations.
In Univariate Analysis, only one variable is analyzed at a time. This analysis is used to describe the data and find patterns that exist within it.
country_df.describe()
child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
---|---|---|---|---|---|---|---|---|---|
count | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 | 167.000000 |
mean | 38.270060 | 7420.618847 | 1056.733204 | 6588.352108 | 17144.688623 | 7.781832 | 70.555689 | 2.947964 | 12964.155689 |
std | 40.328931 | 17973.885795 | 1801.408906 | 14710.810418 | 19278.067698 | 10.570704 | 8.893172 | 1.513848 | 18328.704809 |
min | 2.600000 | 1.076920 | 12.821200 | 0.651092 | 609.000000 | -4.210000 | 32.100000 | 1.150000 | 231.000000 |
25% | 8.250000 | 447.140000 | 78.535500 | 640.215000 | 3355.000000 | 1.810000 | 65.300000 | 1.795000 | 1330.000000 |
50% | 19.300000 | 1777.440000 | 321.886000 | 2045.580000 | 9960.000000 | 5.390000 | 73.100000 | 2.410000 | 4660.000000 |
75% | 62.100000 | 7278.000000 | 976.940000 | 7719.600000 | 22800.000000 | 10.750000 | 76.800000 | 3.880000 | 14050.000000 |
max | 208.000000 | 183750.000000 | 8663.600000 | 149100.000000 | 125000.000000 | 104.000000 | 82.800000 | 7.490000 | 105000.000000 |
country_df.columns
Index(['country', 'child_mort', 'exports', 'health', 'imports', 'income', 'inflation', 'life_expec', 'total_fer', 'gdpp'], dtype='object')
features =['child_mort', 'exports', 'health', 'imports', 'income', 'inflation',
'life_expec', 'total_fer', 'gdpp']
plt.figure(figsize=(12,12))
for i in enumerate(features):
ax = plt.subplot(3, 3, i[0]+1)
sns.distplot(country_df[i[1]])
plt.xticks(rotation=0)
plt.tight_layout()
plt.figure(figsize=(12,12))
for i in enumerate(features):
ax = plt.subplot(3, 3, i[0]+1)
sns.boxplot(x=country_df[i[1]])
plt.xticks(rotation=0)
plt.tight_layout()
In Multivariate Analysis, more than two different variables are analyzed. This analysis deals with causes and relationships and the analysis is done to find out the relationship between the variables.
plt.figure(figsize=(14,9))
sns.pairplot(country_df, corner=True)
plt.show()
<Figure size 1400x900 with 0 Axes>
# Heatmap to determine the correlation between the features.
plt.figure(figsize=(14,9))
sns.heatmap(country_df.corr(), annot = True, cmap="YlGnBu")
<Axes: >
##To find the degree of the relationship amongst the variables, a correlation function is used
country_df.corr()
child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | |
---|---|---|---|---|---|---|---|---|---|
child_mort | 1.000000 | -0.297230 | -0.430438 | -0.319138 | -0.524315 | 0.288276 | -0.886676 | 0.848478 | -0.483032 |
exports | -0.297230 | 1.000000 | 0.612919 | 0.987686 | 0.725351 | -0.141553 | 0.377694 | -0.291096 | 0.768894 |
health | -0.430438 | 0.612919 | 1.000000 | 0.638581 | 0.690857 | -0.253956 | 0.545626 | -0.407984 | 0.916593 |
imports | -0.319138 | 0.987686 | 0.638581 | 1.000000 | 0.672056 | -0.179458 | 0.397515 | -0.317061 | 0.755114 |
income | -0.524315 | 0.725351 | 0.690857 | 0.672056 | 1.000000 | -0.147756 | 0.611962 | -0.501840 | 0.895571 |
inflation | 0.288276 | -0.141553 | -0.253956 | -0.179458 | -0.147756 | 1.000000 | -0.239705 | 0.316921 | -0.221631 |
life_expec | -0.886676 | 0.377694 | 0.545626 | 0.397515 | 0.611962 | -0.239705 | 1.000000 | -0.760875 | 0.600089 |
total_fer | 0.848478 | -0.291096 | -0.407984 | -0.317061 | -0.501840 | 0.316921 | -0.760875 | 1.000000 | -0.454910 |
gdpp | -0.483032 | 0.768894 | 0.916593 | 0.755114 | 0.895571 | -0.221631 | 0.600089 | -0.454910 | 1.000000 |
The above analysis has provided some insight into the relationships of the variables. Hence, the next steps involve selecting the right features to perform clustering analysis then treating the outliers for these features
Based on the objective, we are to select countries in need based on socio-economic and health factors. Therefore, we need to know what features/variables fall under these factors.
Health Factors | Socio-Economic Factors |
---|---|
Child Mortality | Exports |
Health. | Imports |
Total Fertility | Income |
Life Expectancy | Inflation |
GDP Per Capita |
From the box plots in the univariate analysis, both child mortatlity and income have outliers and are skewed to the right. In treating these outliers, we would refrain from the deletion method since this would exclude countries that are in the direst need of aid. Hence, there would be no outlier tratment for Child Mortality.
For Income, we are more focused on countries with low income per person so we would adopt the winorization method or percentile capping to treat the high values. Hence, we would cap at the 99th percentile which means that values that are greater than the value at 99th percentile are replaced by the value of the 96th percentile.
### Capping Income based on the 95th persentile
max_income = country_df['income'].quantile(0.96)
max_income
print('Total number of rows getting capped for income : ', len(country_df[country_df['income']>max_income]))
Total number of rows getting capped for income : 7
country_df['income'][country_df['income']>max_income]=max_income
country_df.income.max()
56255.99999999997
## Checking for outliers after capping
sns.boxplot(data=country_df, x="income")
plt.show()
Now that the features have been selected and outliers treated, we can go ahead and prepare and test the data for clustering.
The scales for the selected features are different hence the need to adjust the values and put them on a common scale.
df = country_df[['country','child_mort','income']]
cluster_df = country_df[['country','child_mort','income']].set_index('country')
features = cluster_df.columns
cluster_df
child_mort | income | |
---|---|---|
country | ||
Afghanistan | 90.2 | 1610.0 |
Albania | 16.6 | 9930.0 |
Algeria | 27.3 | 12900.0 |
Angola | 119.0 | 5900.0 |
Antigua and Barbuda | 10.3 | 19100.0 |
... | ... | ... |
Vanuatu | 29.2 | 2950.0 |
Venezuela | 17.1 | 16500.0 |
Vietnam | 23.3 | 4490.0 |
Yemen | 56.3 | 4480.0 |
Zambia | 83.1 | 3280.0 |
167 rows × 2 columns
## Scale the features in the new dataframe
scale=StandardScaler() # INITIALIZE
cluster_df_scaled= pd.DataFrame(scale.fit_transform(cluster_df)) ## fit the data to be studied by the algorithmb
cluster_df_scaled.columns=features
cluster_df_scaled.head()
child_mort | income | |
---|---|---|
0 | 1.291532 | -0.926860 |
1 | -0.538949 | -0.395492 |
2 | -0.272833 | -0.205808 |
3 | 2.007808 | -0.652873 |
4 | -0.695634 | 0.190163 |
The hopkins test is a way to measure the clusterability of a dataset to ascertain if there are any meaningful clusters.
def hopkins(X):
d = X.shape[1] ## Length of colums
n = X.shape[0] ## length of rows
m = int(0.1 * n) # size of the randomly sampled dataset
nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
rand_X = sample(range(0, n, 1), m)
ujd = []
wjd = []
for j in range(0, m):
#draw uniformly from the space that is strechted from the least point to the maximum point & Calculate their distance to the nearest neighbor
u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
ujd.append(u_dist[0][1])
# generate another random sample from the sample itself & Calculate the distance to the nearest neigbor
w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
wjd.append(w_dist[0][1])
H = sum(ujd) / (sum(ujd) + sum(wjd))
if isnan(H):
print(ujd, wjd)
H = 0
return H
hopkins(cluster_df_scaled)
0.9197042466225239
K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the unlabeled dataset into different clusters.
Used to decide on the optimal number of clusters to use
inertia_scores=[] ## create an empty list to put all the inertia scores in once calculated
for i in range(1,11): ## For each cluster in a range of 1 to 10,
kmeans=KMeans(n_clusters=i, random_state=42) ### Initialize the algorithm
kmeans.fit(cluster_df_scaled[['child_mort','income']]) ## Fit the data to be studied by the algorithm
inertia_scores.append(kmeans.inertia_) ### append the calculated WCSS to inertia scores that was created
plt.plot(range(1,11), inertia_scores, marker='o')
plt.title('The Elbow Method')
plt.ylabel("WCSS")
plt.xlabel('Number of Clusters')
plt.show()
k=KneeLocator(range(1,11), inertia_scores, curve='convex', direction='decreasing')
k.elbow
3
## With our chosen cluster, we can input it into the KMeans algorithm
kmeans=KMeans(n_clusters=3, random_state=50) ## initialize algorithm with optimal clusters
kmeans.fit(cluster_df_scaled)
kmeans.labels_
array([1, 2, 2, 1, 2, 2, 2, 0, 0, 2, 2, 0, 2, 2, 2, 0, 2, 1, 2, 2, 2, 2, 2, 0, 2, 1, 1, 2, 1, 0, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 2, 0, 0, 0, 2, 2, 2, 2, 1, 2, 2, 2, 0, 0, 2, 1, 2, 0, 1, 0, 2, 2, 1, 1, 2, 1, 2, 0, 2, 2, 2, 2, 0, 0, 0, 2, 0, 2, 2, 1, 1, 0, 2, 1, 2, 2, 1, 1, 0, 2, 0, 2, 1, 1, 2, 2, 1, 0, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 0, 0, 1, 1, 0, 0, 1, 2, 2, 2, 2, 2, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2, 1, 0, 2, 0, 2, 2, 0, 0, 2, 2, 1, 2, 0, 0, 2, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 1], dtype=int32)
# Assign clustering result to each country in the data frame
cluster_df['Cluster Label']=kmeans.labels_
cluster_df
child_mort | income | Cluster Label | |
---|---|---|---|
country | |||
Afghanistan | 90.2 | 1610.0 | 1 |
Albania | 16.6 | 9930.0 | 2 |
Algeria | 27.3 | 12900.0 | 2 |
Angola | 119.0 | 5900.0 | 1 |
Antigua and Barbuda | 10.3 | 19100.0 | 2 |
... | ... | ... | ... |
Vanuatu | 29.2 | 2950.0 | 2 |
Venezuela | 17.1 | 16500.0 | 2 |
Vietnam | 23.3 | 4490.0 | 2 |
Yemen | 56.3 | 4480.0 | 2 |
Zambia | 83.1 | 3280.0 | 1 |
167 rows × 3 columns
cluster_df['Cluster Label'].value_counts(ascending=True)
0 38 1 41 2 88 Name: Cluster Label, dtype: int64
### Visualize to better understand clustering result
plt.figure(figsize=(10,8))
sns.scatterplot(data=cluster_df, x='child_mort',y='income',hue='Cluster Label', palette="Dark2")
plt.show()
On a development scale, the clusters can be classified into:
Color | Cluster Label. | Development scale |
---|---|---|
Green. | 0 | Developed Countries |
Orange | 1 | Under Developed Countries |
Purple. | 2 | Developing Countries |
According to the clusters above, the countries in the most need of financial assistance are those in Cluster 1 (Under-developed Countries).
Underdeveloped = cluster_df[cluster_df['Cluster Label'] == 1]
Underdeveloped.head(5)
child_mort | income | Cluster Label | |
---|---|---|---|
country | |||
Afghanistan | 90.2 | 1610.0 | 1 |
Angola | 119.0 | 5900.0 | 1 |
Benin | 111.0 | 1820.0 | 1 |
Burkina Faso | 116.0 | 1430.0 | 1 |
Burundi | 93.6 | 764.0 | 1 |
Since Cluster 1 has 41 countries, we will narrow the selection down to the top 10 countries that need financial assistance the most. That is, countries with the highest child mortality rates and those with the lowest income levels.
# Rank countries in cluster 1 based on child mortality and income levels combined
Underdeveloped['rank'] = (Underdeveloped['income'].rank(ascending=True) + Underdeveloped['child_mort'].rank(ascending=False)).rank()
Underdeveloped["rank"].nsmallest(10)
country Central African Republic 1.0 Congo, Dem. Rep. 2.0 Niger 3.0 Sierra Leone 4.0 Haiti 5.0 Burundi 7.0 Guinea 7.0 Mozambique 7.0 Guinea-Bissau 9.0 Burkina Faso 10.0 Name: rank, dtype: float64
plt.figure(figsize=(18,6))
sns.barplot(x = "country",
y='child_mort',
palette="rocket",
data=Underdeveloped.reset_index().nlargest(20, 'child_mort'))
plt.title("TOP 20 COUNTRIES WITH THE HIGHEST CHILD MORTALITY RATES")
plt.xticks(rotation = 45, horizontalalignment = "right")
plt.show()
plt.figure(figsize=(18,6))
sns.barplot(x='country',
y='income',
palette="rocket",
data=Underdeveloped.reset_index().nsmallest(20, 'income'))
plt.title("TOP 10 COUNTRIES WITH THE LOWEST INCOME LEVELS ")
plt.xticks(rotation = 45, horizontalalignment = "right")
plt.show()